Data Exploration
After cleaning the data gathered, we can begin the process of exploratory data analysis. This analysis is valuable for gaining an understanding of the data and seeking data trends, relationships between variables, anomalies or outliers, and other findings. In most cases, this type of analysis is beneficial to improve the data cleaning and apply additional measures if necessary. Based on the initial analysis of the data, we can confirm our assumptions and proceed to develop hypotheses to guide the investigation.
As a start to this analysis, it’s crucial to know about two primary types of analysis: graphical techniques and quantitative techniques. Among the graphical techniques, we find univariate visualizations such as box plots, histograms, and line graphs, while in the bivariate analysis, we find the scatter plot, and heatmaps, among others. The quantitative techniques require technical knowledge of more complex algorithms. The most simple quantitative techniques are the summary statistics we usually use for the box plot and the distribution visualizations. More advanced methods involve clustering and dimensionality reduction, which will be explained and applied in subsequent sections.
In this section, we use both R and Python to develop visualizations. The libraries required and functions created for the analysis are listed below:
Below are the libraries used in this section:
R Libraries
library(tidyverse)
library(ggplot2)
library(forecast)
library(astsa)
library(xts)
library(tseries)
library(fpp2)
library(fma)
library(lubridate)
library(TSstudio)
library(quantmod)
library(tidyquant)
library(plotly)
library(gridExtra)
library(readxl)
library(zoo)
library(knitr)
library(kableExtra)
library(patchwork)
library(corrplot)Python Libraries
import pandas as pd
import matplotlib.pyplot as plt
from cleantext import cleanFunction generate_word_cloud()
def generate_word_cloud(my_text, title="Word Cloud"):
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt
# Define a function to plot word cloud
def plot_cloud(wordcloud, title):
# Set figure size
plt.figure(figsize=(30, 20))
# Display image
plt.imshow(wordcloud)
# No axis details
plt.axis("off")
# Add title
plt.title(title, fontsize=100)
# Generate word cloud
wordcloud = WordCloud(
width=2000,
height=1000,
random_state=1,
# background_color='salmon',
colormap='Pastel1',
collocations=False,
stopwords=STOPWORDS
).generate(my_text)
plot_cloud(wordcloud, title)
plt.show()Global Lithium Production
In the first visualization we can observe a map plot done using Tableau Software. Using the controls to change the years of visualization we can observe the evolution of the global lithium production and how the total production is distributed along the main producing countries. As it was explained earlier, the data includes information since 1995 as this resource is very innovative.
The global lithium production over the past two decades reveals a notable evolution in the distribution among the major producing countries. In 1995, Australia, the United States, and Chile were prominent lithium producers, each with a modest contribution of less than 3,500 tons. Fast forward to 2015, Australia jumped to about 12,000 tons, Chile reached nearly 10,000 tons, and Argentina contributed around 3,600 tons. The most recent data from 2022 shows a significant increase in lithium production, with Australia leasing with 61,000 tons, Chile following closely with about 38,000 tons, and China producing 19000 tons. This analysis underscores the remarkable growth in the lithium industry, indicating a persistent upward trend in production over the years.
Lithium Companies
In examining the stock market performance of lithium producing companies over the past two decades, we focus on time series data. Albemarle Corporation (ALB) stands out as a market participant that has shown consistent growth since 2000. ALB’s stock price has exceeded $37 per share since 2015, with a significant peak of over $300 in 2022, followed by post-peak fluctuations around $200. In particular, the pandemic-related slowdown in 2020 temporarily impacted ALB’s value. In contrast, Lithium Americas Corp (LTHM) and Sigma Lithium Corp (SGML), which were introduced in 2018, show increasing values, approaching $30 per share. The comparison highlights ALB’s significant market presence and robust performance, with a marked difference in valuation trends compared to its newer competitors.
Visualization
# Read cleaned file
df_companies <- read.csv('../../data/01-modified-data/clean_lithium-companies.csv')
df_companies_1 <- df_companies %>% dplyr::select('date', 'ALB', 'LTHM', 'SGML')
df_companies_1 <- gather(df_companies_1, key = "stock", value = "price", -date)
# Change data type
df_companies_1$date <- as.Date(df_companies_1$date)
# Create ggplot line plot
viz_companies_1 <- ggplot(df_companies_1, aes(x = date, y = price, color = stock)) +
geom_line() +
labs(title = "Stock Prices - Lithium Production Companies",
x = "Date",
y = "Stock Price") +
scale_color_discrete(name = "Stock")
# Show plot
viz_companies_1 %>% ggplotly()We look at the stock market performance of the major electric vehicle manufacturers over the past two decades. In the early 2000s, AEHR, ON, and F were the major players. Around 2010, Tesla came on the scene and changed the landscape. It wasn’t until around 2019 that other companies emerged. Notably, in 2020, Tesla experienced a significant peak in the stock market, with prices spiking above $200 and reaching a remarkable $400. This spike and subsequent volatility positioned Tesla as a market leader relative to its competitors, demonstrating its dominance in the electric vehicle manufacturing sector.
Visualization
df_companies_2 <- df_companies %>% dplyr::select('date', 'TSLA', 'F', 'LI', 'ON', 'RIVN', 'XPEV', 'LVWR', 'AEHR')
df_companies_2 <- gather(df_companies_2, key = "stock", value = "price", -date)
# Change data type
df_companies_2$date <- as.Date(df_companies_2$date)
# Create ggplot line plot
viz_companies_2 <- ggplot(df_companies_2, aes(x = date, y = price, color = stock)) +
geom_line() +
labs(title = "Stock Prices - Electric Vehicles Production Companies",
x = "Date",
y = "Stock Price") +
scale_color_discrete(name = "Stock")
# Show plot
viz_companies_2 %>% ggplotly()In analyzing the stock market performance of lithium battery manufacturing companies over the past two decades, we focus on Panasonic Holdings Corporation of Japan and BYD CO., LTD of China. Panasonic has been around since 2000, with initial values near $2000 per share. However, they have experienced considerable volatility, dropping below $1000 per share and currently hovering around $1200 per share, indicating a challenging period for the company. In contrast, BYD CO., LTD entered the market in 2009 and has shown a remarkable upward trend, exceeding $20 per share. Despite this positive trajectory, BYD’s market presence remains comparatively smaller than Panasonic’s. This comparison highlights the contrasting fortunes of these two companies in the lithium battery manufacturing sector.
Visualization
df_companies_3 <- df_companies %>% dplyr::select('date', 'BYDDF', 'X6752.T')
df_companies_3 <- gather(df_companies_3, key = "stock", value = "price", -date)
# Change data type
df_companies_3$date <- as.Date(df_companies_3$date)
# Create ggplot line plot
viz_companies_3 <- ggplot(df_companies_3, aes(x = date, y = price, color = stock)) +
geom_line() +
labs(title = "Stock Prices - Lithium Batteries Production Companies",
x = "Date",
y = "Stock Price") +
scale_color_discrete(name = "Stock")
# Show plot
viz_companies_3 %>% ggplotly()Resources Prices
In this analysis, we created a correlation plot heatmap using the Pearson correlation coefficient to visually represent the relationships between these resources. The results show that lithium has a high correlation with zinc, nickel and aluminum. However, the correlation with copper and cobalt is less pronounced. This observation provides valuable insight into the interdependencies among these critical resources and provides a basis for understanding their co-variation patterns in the context of lithium battery production.
Visualization
# Read csv file
df_resources <- read.csv("../../data/01-modified-data/clean_resources-price.csv")
# Edit datatypes
df_resources$DATE <- as.Date(df_resources$DATE)
# Create correlation matrix
correlation_matrix <- cor(df_resources[, -1], use = "complete.obs")
correlation_matrix <- reshape2::melt(correlation_matrix)
ggplot(correlation_matrix, aes(x = Var1, y = Var2, fill = value, label = sprintf("%.2f", value))) +
geom_tile(color = "white") +
geom_text(size = 3, color = "white") +
scale_fill_gradient2(low = "blue", mid = "yellow", high = "red", midpoint = 0) +
theme_minimal() +
labs(title = "Resources Correlation Plot")Electric Vehicles
To assess the relationship between the different characteristics of electric vehicles, we calculated the Pearson correlation coefficient. We analyze these values using a heat map. The results show that there are some variables that are positively correlated, inversely correlated and weakly correlated. The most relevant cases are the price which is highly correlated with the top speed and the battery. The battery is also highly correlated with range. Acceleration is inversely highly correlated with top speed, range, fast charge and acceleration.
Visualization
# Save dataframe as a new file
df_vehicles <- read.csv('../../data/01-modified-data/clean_vehicles.csv')
df_vehicles <- df_vehicles %>% dplyr::select('Battery', 'Efficiency', 'FastCharge', 'Price', 'Range', 'TopSpeed', 'Acceleration')
# Create correlation matrix
correlation_matrix2 <- cor(df_vehicles, use = "complete.obs")
correlation_matrix2 <- reshape2::melt(correlation_matrix2)
ggplot(correlation_matrix2, aes(x = Var1, y = Var2, fill = value, label = sprintf("%.2f", value))) +
geom_tile(color = "white") +
geom_text(size = 3, color = "white") +
scale_fill_gradient2(low = "blue", mid = "white", high = "red", midpoint = 0) +
theme_minimal() +
labs(title = "Electric Vehicles Characteristics Correlation Plot")Lithium News Sentiment Analysis
Using the sentiment analysis labels assigned to lithium-related news articles, word clouds were created for each sentiment category. The most recurring words across sentiments included “new,” “electric,” “vehicle,” “car,” and “batteries.” Notably, the consistency of the most common words across sentiments suggests a commonality in the language used, regardless of the sentiment assigned to the news article.
WordCloud - Positive Sentiment
Visualization
# Save dataframe as a new file
df_sentiment = pd.read_csv('../../data/01-modified-data/clean_sentiment_analysis.csv')
df_sentiment = df_sentiment[df_sentiment['ibm_label'] == "positive"]
complete_text = ""
for i in df_sentiment['ibm_content']:
text=i
if not pd.isna(text):
text = clean(text,
fix_unicode=True, # fix various unicode errors
to_ascii=True, # transliterate to closest ASCII representation
lower=True, # lowercase text
no_line_breaks=False, # fully strip line breaks as opposed to only normalizing them
no_urls=True, # replace all URLs with a special token
no_emails=True, # replace all email addresses with a special token
no_phone_numbers=True, # replace all phone numbers with a special token
no_numbers=True, # replace all numbers with a special token
no_digits=True, # replace all digits with a special token
no_currency_symbols=True, # replace all currency symbols with a special token
no_punct=False, # remove punctuations
replace_with_punct="", # instead of removing punctuations you may replace them
replace_with_url="",
replace_with_email="",
replace_with_phone_number="",
replace_with_number="",
replace_with_digit="0",
replace_with_currency_symbol="",
lang="en" # set to 'de' for German special handling
)
text = text.replace('... [ chars]', ' ')
text = text.replace('<ul>', ' ')
text = text.replace('<li>', ' ')
complete_text = complete_text+text
generate_word_cloud(complete_text, "WordCloud - Positive Sentiment\n")WordCloud - Neutral Sentiment
Visualization
# Save dataframe as a new file
df_sentiment = pd.read_csv('../../data/01-modified-data/clean_sentiment_analysis.csv')
df_sentiment = df_sentiment[df_sentiment['ibm_label'] == "neutral"]
complete_text = ""
for i in df_sentiment['ibm_content']:
text=i
if not pd.isna(text):
text = clean(text,
fix_unicode=True, # fix various unicode errors
to_ascii=True, # transliterate to closest ASCII representation
lower=True, # lowercase text
no_line_breaks=False, # fully strip line breaks as opposed to only normalizing them
no_urls=True, # replace all URLs with a special token
no_emails=True, # replace all email addresses with a special token
no_phone_numbers=True, # replace all phone numbers with a special token
no_numbers=True, # replace all numbers with a special token
no_digits=True, # replace all digits with a special token
no_currency_symbols=True, # replace all currency symbols with a special token
no_punct=False, # remove punctuations
replace_with_punct="", # instead of removing punctuations you may replace them
replace_with_url="",
replace_with_email="",
replace_with_phone_number="",
replace_with_number="",
replace_with_digit="0",
replace_with_currency_symbol="",
lang="en" # set to 'de' for German special handling
)
text = text.replace('... [ chars]', ' ')
text = text.replace('<ul>', ' ')
text = text.replace('<li>', ' ')
complete_text = complete_text+text
generate_word_cloud(complete_text, "WordCloud - Neutral Sentiment\n")WordCloud - Negative Sentiment
Visualization
# Save dataframe as a new file
df_sentiment = pd.read_csv('../../data/01-modified-data/clean_sentiment_analysis.csv')
df_sentiment = df_sentiment[df_sentiment['ibm_label'] == "negative"]
complete_text = ""
for i in df_sentiment['ibm_content']:
text=i
if not pd.isna(text):
text = clean(text,
fix_unicode=True, # fix various unicode errors
to_ascii=True, # transliterate to closest ASCII representation
lower=True, # lowercase text
no_line_breaks=False, # fully strip line breaks as opposed to only normalizing them
no_urls=True, # replace all URLs with a special token
no_emails=True, # replace all email addresses with a special token
no_phone_numbers=True, # replace all phone numbers with a special token
no_numbers=True, # replace all numbers with a special token
no_digits=True, # replace all digits with a special token
no_currency_symbols=True, # replace all currency symbols with a special token
no_punct=False, # remove punctuations
replace_with_punct="", # instead of removing punctuations you may replace them
replace_with_url="",
replace_with_email="",
replace_with_phone_number="",
replace_with_number="",
replace_with_digit="0",
replace_with_currency_symbol="",
lang="en" # set to 'de' for German special handling
)
text = text.replace('... [ chars]', ' ')
text = text.replace('<ul>', ' ')
text = text.replace('<li>', ' ')
complete_text = complete_text+text
generate_word_cloud(complete_text, "WordCloud - Negative Sentiment\n")